Systematic Biology
◐ Oxford University Press (OUP)
Preprints posted in the last 30 days, ranked by how well they match Systematic Biology's content profile, based on 121 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Milkey, A.; Chen, J.; Lewis, P. O.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWAs modern phylogenomics datasets become increasingly large, it is useful to develop recommendations for how to subsample datasets for best species tree inference. Here we apply a new measure of phylogenetic information content that estimates the reduction in tree space occupied by a posterior sample of inferred trees relative to a prior sample in order to assess the effects of gene tree parameters on species tree estimation. We find that, consistent with earlier studies, when data are informative, more data result in better species tree inference. However, when data are uninformative, subsampling a dataset to include only the most informative loci may produce a better species tree sample. We perform analyses on a variety of simulated and empirical datasets.
Nagel, A. A.; Landis, M. J.
Show abstract
Ancestral state reconstruction is a classical problem of broad relevance in phylogenetics. Likelihood-based methods for reconstructing ancestral states under discrete character models, such as Markov models, have proven extremely useful, but only work so long as the assumed model yields a tractable likelihood function. Unfortunately, extending a simple but tractable phylogenetic model to possess new, but biologically realistic, properties often results in an intractable likelihood, preventing its use in standard modeling tasks, including ancestral state reconstruction. The rapid advancement of deep learning offers a potential alternative to likelihood-based inference of ancestral states, particularly for models with intractable likelihoods. In this study, we modify the phylogenetic deep learning software O_SCPLOWPHYDDLEC_SCPLOW to conduct ancestral state reconstruction. We evaluate O_SCPLOWPHYDDLEC_SCPLOWs performance under various methodological and modeling conditions, while comparing to Bayesian inference when possible. For simple models and small trees, its performance resembles the performance of Bayesian inference, but worsens as tree size increases. While O_SCPLOWPHYDDLEC_SCPLOW still performs adequately for more complex models, such as speciation and extinction models, the estimates differ more from Bayesian inference in comparison with simpler models. Lastly, we use O_SCPLOWPHYDDLEC_SCPLOW to infer ancestral states for two empirical datasets, one of the ancestral ranges of a subclade of the genus Liolaemus and ancestral locations for sequences from the 2014 Sierra Leone Ebola virus disease outbreak.
Milkey, A.; Lewis, P. O.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWA new Bayesian measure of phylogenetic information content is introduced based on geodesic distances in treespace. The measure is based on the relative variance of phylogenetic trees sampled from the posterior distribution compared to the prior distribution. This ratio is expected to equal 1 if there is no information in the data about phylogeny and 0 if there is complete information. Trees can be scaled to have the same mean tree length to avoid dominance by edge length information and focus on topological information. The method scales well, requiring only that a valid sample can be obtained from both prior and posterior distributions. We show how dissonance (information conflict) among data sets can also be estimated. Both simulated and empirical examples are provided to illustrate that the new approach produces sensible and intuitive results.
De Maio, N.
Show abstract
Maximum likelihood phylogenetic methods are popular approaches for estimating evolutionary histories. These methods do not assume prior hypotheses regarding the shape of the phylogenetic tree, and this lack of prior assumptions can be useful in particular in case of idiosyncratic sampling patterns. For example, the rate at which species are sequenced can differ widely between lineages, with lineages more of interest to humans being usually sequenced more often than others. However, in some settings sampling can be lineage-agnostic. In genomic epidemiology, for example, the sequencing rate can change through time or across locations, but is often agnostic to the specific pathogen strain being sequenced. In this scenario, one expects that the abundance of a pathogen strain at a specific time and location in the host population is reflected in the relative abundance of that strain among the genomes sequenced at that time and location. Here, I show that this simple assumption, when appropriate and incorporated within maximum likelihood phylogenetics, can greatly improve the accuracy of phylogenetic inference. This is similar to the famous medical principle "when you hear hoofbeats, think of horses, not zebras". In our application this means that, when for example observing a (possibly incomplete) genome sequence that has a similar likelihood of belonging to multiple different strains, I aim to prioritize phylogenetic placement onto a common strain (the "horse", a common disease) rather than a rare one (the "zebra", a rare disease). I introduce and assess two separate approaches to achieve this. The first approach rescales the likelihood of a phylogenetic tree by the number of distinct binary topologies obtainable by arbitrarily resolving multifurcations in the tree. This approach is based on a new interpretation of multifurcating phylogenetic trees particularly relevant at low divergence: multifurcations represent a lack of signal for resolving the bifurcating topology rather than an instantaneous multifurcating event, and so a multifurcating tree is interpreted as the set of bifurcating trees consistent with the multifurcating one, rather than as a single multifurcating topology. The second approach instead includes a tree prior that assumes that genomes are sequenced at a rate proportional to their abundance. Both approaches favor phylogenetic placement at abundant lineages, and using simulations I show that both methods dramatically improve the accuracy of phylogenetic inference in scenarios like SARS-CoV-2 phylogenetics, where large multifurcations are common. This considerable impact is also observed in real pandemic-scale SARS-CoV-2 genome data, where accounting for lineage prevalence reduces phylogenetic uncertainty by around one order of magnitude. Both approaches were implemented as part of the free and open source phylogenetic software MAPLE v0.7.5.4 (https://github.com/NicolaDM/MAPLE).
Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.
Show abstract
Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely difficult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments. Full online versionhttps://www.biorxiv.org/content/10.1101/2025.11.20.689557
Feigin, C. Y.; Trybulec, E.; Ferguson, R.; Scicluna, E. L.; Sauermann, R.; Hartley, G. A.; O'Neill, R. J.; Pask, A. J.
Show abstract
Small marsupials in the family Dasyuridae are a key component of Australias arid and semi-arid fauna, whose high species richness is proposed to reflect an opportunity-driven adaptive radiation. Despite growing interest in this group from both ecological and evolutionary perspectives, genomic data for most species is non-existent, or limited to a few marker loci. Here, we generated a chromosome-level reference genome and a de novo mitochondrial genome for the desert-dwelling Wongai ningaui (Ningaui ridei). The nuclear genome assembly is highly contiguous, with a scaffold N50 of 594.484 MB and high BUSCO gene recovery (93.84%). Additionally, we produced a draft assembly for the related, semi-arid slender-tailed dunnart (Sminthopsis murina). We then used these assemblies to explore the demographic histories of these species. We find evidence for contrasting patterns of population growth during the late Pleistocene and early Holocene, corresponding with differences in local climate, potentially consistent with differences in optimal habitat. The new genomic resources and demographic findings presented here provide a foundation for future studies on adaptive specialisation in this group of Australian marsupials. Significance StatementDasyurid marsupials are the primary carnivorous and insectivorous mammals in Australia. This diverse family includes species such as the endangered Tasmanian devil (Sarcophilus harrisii) and quolls (Genus Dasyurus), as well as an emerging laboratory model species, the fat-tailed dunnart (Sminthopsis crassicaudata). Despite the species richness within dasyurids, most species remain under-studied. This is particularly true of arid and semi-arid zone species, who are often small in size, live in remote habitats and are cryptic by nature. By creating genome assemblies for two dasyurid species, this study provides resources to support a variety of phylogenetic, population genetic and evolutionary developmental lines of research. Importantly, the studys finding that arid and semi-arid dasyurids show distinct trajectories of demographic change in response to historical climatic shifts may point to local adaptations with implications for the resilience of these species to ongoing and future climate change.
Leroy, R. B.; Eme, L.; Lopez-Garcia, P.; Moreira, D.
Show abstract
Understanding the phylogenetic relationships among eukaryotic lineages is essential for tracing the evolution of key phenotypic traits and inferring the nature of the Last Eukaryotic Common Ancestor. While phylogenomic analyses have clustered eukaryotic taxa into several well-supported major supergroups, the relationships among them remain largely uncertain. Phylogenetic signal erosion over deep time and limited available taxon sampling are among the possible causes. However, most previous studies rely on variations of the same core protein dataset, hence containing the same potential systematic biases. Here, we reconstructed the eukaryotic Tree of Life using a largely independent, marker-rich dataset derived from highly conserved Benchmarking Universal Single-Copy Orthologs. Unlike previous collections, our 277-marker supermatrix minimizes ribosomal protein representation and shares less than 25% overlap with previous datasets. State-of-the-art analyses of this dataset confirm most eukaryotic supergroups previously observed, but suggest different positions for some lineages. Notably, Telonemia clusters with Haptophyta rather than SAR (Stramenopiles-Alveolata-Rhizaria), and Ancyromonadida and Malawimonadida form a monophyletic group at the base of the Opimoda. Our results highlight the importance of analyzing independent phylogenomic datasets and support the hypothesis that extant eukaryotic diversity encompasses a small number of high-rank, supergroup lineages.
Dornburg, A.; Su, Z. T.; Jin, Y.; Fisk, N.; Townsend, J. P.
Show abstract
Phylogenomic datasets assembled to resolve the Tree of Life now routinely span thousands of loci comprising millions of characters. Yet the persistence of incongruent topologies across such datasets reveals a fundamental truth of phylogenetics: not all data are equally informative. Here we derive analytical approaches that predict the relative impacts of phylogenetic signal, stochastic noise, and systematic bias on phylogenetic inference. We show that these three components exhibit divergent scaling properties with character sampling: signal and bias accumulate linearly, while noise accumulates nonlinearly with a concave trajectory. For some phylogenetic problems, substantial amounts of phylogenetic noise may eventually be overwhelmed by signal. For other phylogenetic problems--especially those involving deep divergences, short internodes, or constrained character-state space--the slope of signal accumulation can be so shallow that even signal from genome-scale data may never practically exceed noise. Moreover, linear accumulation of phylogenetic bias can in principle continuously overwhelm accumulation of signal at a lower slope with additional characters, regardless of dataset size. Applying our theory to empirical datasets, we show that anchored hybrid enrichment and ultraconserved element loci, like any loci, can exhibit signal that is overwhelmed by noise, and that character acquisition biases in some loci can further confound inference. Given the pervasive nature of incongruence in the phylogenomic era, our work provides a theoretical foundation for understanding the limits of inference, improving experimental design, and guiding efficient and accurate resolution of the Tree of Life.
Leache, A.; Davis, H.; Guerra, E.; Herrera, A.; Lemos-Espinal, J.; Fujita, M.; Myers, T. C.; Singhal, S.
Show abstract
Species delimitation is a fundamental challenge in systematic biology, particularly for geographically variable taxa with hierarchical population structure and gene flow. Migration-aware coalescent models provide a powerful framework for investigating lineage divergence and accurately defining species boundaries. In this study, we combine statistical evaluations of gene flow with phylogenetic and population structure analyses to delimit species of fence lizards within the Sceloporus undulatus complex, a group characterized by extensive population subdivision, mitochondrial DNA introgression, and nuclear gene flow. We find that the undulatus complex exhibits uneven variation in genetic, morphological, and bioclimatic traits, resulting in variable distinctiveness among groups. In some cases, species boundaries are recognized by clear genetic discontinuities without gene flow. In others, shallow divergence, paraphyly, and gene flow produce leaky boundaries and fuzzy species limits. Mitochondrial introgression is extensive and concentrated at species boundaries, whereas nuclear gene flow occurs between only a few species and at much lower levels than within species. Neither within-species populations or species are substantially diverged across morphology or bioclimatic space, highlighting the limited utility of these traits for diagnosing species in this group. By integrating estimates of gene flow with phylogenetic and population structure analyses, this study provides a robust and biologically meaningful revised taxonomic framework for the undulatus complex that identifies independently evolving lineages as species.
Koshkarov, A.; Tahiri, N.
Show abstract
Phylogenetic trees represent the evolutionary histories of taxa and support tasks such as clustering and Tree of Life reconstruction. Many established comparison methods, including the Robinson-Foulds (RF) distance, assume identical taxon sets. A methodological gap remains for trees with distinct but overlapping taxa. Existing approaches either prune non-common leaves, which can discard information, or complete both trees such that they share the same taxa. Completion is more comprehensive, but current methods typically ignore branch lengths, which are essential for identifying evolutionary patterns. This paper introduces k-Nearest Common Leaves (k-NCL), an algorithm for completing rooted phylogenetic trees defined on different but overlapping taxa. The method uses branch lengths and topological characteristics and does not rely on a specific distance measure. The k-NCL algorithm is designed to preserve evolutionary relationships in the trees under comparison. The running time is O(n2), where n is the size of the union of the two leaf sets. Additional properties include preservation of original distances and topology, symmetry, and uniqueness of the completion. Implemented in Python, k-NCL is evaluated on biological datasets of amphibians, birds, mammals, and sharks. Experimental results show that RF combined with k-NCL improves phylogenetic tree clustering performance compared to the RF(+) tree completion approach. Availability and implementationAn open-source implementation of k-NCL in Python and the datasets used in this study are available at https://github.com/tahiri-lab/KNCL.
Leone, M.; Rech De Laval, V.; Drage, H. B.; Waterhouse, R. M.; Robinson-Rechavi, M.
Show abstract
Integrating taxonomic data from various sources presents a significant challenge in the study of biodiversity research, due to non-standardized nomenclature and evolving species classifications. Discrepancies between major repositories like the Global Biodiversity Information Facility (GBIF) and the National Center for Biotechnology Information (NCBI), as well as citizen science platforms such as iNaturalist, lead to fragmented and sometimes inaccurate biological data. We present TaxonMatch, a tool designed to address these challenges. TaxonMatch aligns taxonomic names, resolves synonymy, and corrects typographical and structural inconsistencies across databases. We show how it can be used to build a common backbone arthropod taxonomy over NCBI, GBIF and iNaturalist, to find the closest molecular data to a given fossil, and to identify IUCN endangered species with molecular data. TaxonMatch provides a cohesive taxonomic framework and a consistent taxonomic backbone, and can be applied to any taxonomic source. The tool is available at https://github.com/MoultDB/TaxonMatch.
Soares, L. S.; Fagundes, N. R.; Bombarely, A.; Freitas, L. B.
Show abstract
The remarkable diversity of life on Earth results from evolutionary processes functioning across different spatial and temporal scales. Species diversification occurs through various mechanisms and at widely varying rates, but identifying the conditions that trigger bursts of diversification over short timescales remains a central challenge in evolutionary biology. This difficulty is more pronounced when incomplete lineage sorting (ILS), hybridization, and ongoing gene flow obscure evolutionary relationships and complicate species delimitation. In this study, we investigated the evolutionary history and species boundaries within a group of recently diverged Petunia lineages shaped by pervasive gene flow. We integrated phylogenomic, population genetic, and species delimitation approaches to reconstruct lineage relationships and assess whether these lineages represent distinct species or stages along a speciation continuum. By applying methods that account for both ILS and gene flow, we revealed that most lineages are not fully independent evolutionary units but rather occupy intermediate positions along this continuum. Gene flow played a crucial role during diversification, blurring species boundaries and generating reticulate evolutionary patterns. Our findings demonstrate that traditional phylogenetic trees may oversimplify relationships in such systems, while phylogenetic networks offer a more accurate representation of evolutionary history. Comprehensive and integrative analyses, such as those employed here, are essential for capturing these complex dynamics. Ultimately, only four lineages could be confidently recognized as distinct species, whereas the remaining represent cases of ongoing divergence. These results emphasize the need to refine species delimitation frameworks for systems characterized by recent divergence and extensive reticulation.
Rosenbaum, S.; Grebe, N.; Silk, J. B.
Show abstract
Understanding the distribution of paternity within social groups is critical for testing hypotheses about the evolution of behavior and morphology in primates, but assembling the requisite comparative data is a challenging task. We compiled genetic paternity data from 52 species of wild nonhuman primates along with information about socioecological, morphological, and life history traits that are relevant to understanding what proportion of offspring are sired by primary males (i.e., alpha males in multi-male groups and resident males in single male groups). Our dataset, which currently contains information about 11 primate families and >3,000 individual paternities, is presented as a publicly accessible, living database designed to be updated as new data become available. Using Bayesian regression models, we investigated the role that phylogeny, group composition, and seasonality play in determining primary males paternity share, and assessed the relative share of paternities obtained by non-primary residents versus extra-group males. First, we found that phylogeny has a detectable but relatively modest influence on primary males paternity share. Species-level differences explained roughly 35-40% of variation in primary males paternity share, and of that interspecific variation, [~]50-70% was attributable to shared phylogenetic history. Second, group composition strongly predicted paternity share outcomes. Primary males in single-male/multi-female groups obtained the highest share of paternity ([~]80%), while those in multi-male groups had the lowest ([~]60%), though there was substantial variation within each category. Pair-living animals showed a striking split: males in cohesive pairs sired [~]90% of offspring, while those in dispersed pairs sired only [~]55%. Contrary to expectations, reproductive seasonality did not predict primary males paternity share in any group type. Finally, when primary males in multi-male groups lost paternities, [~]75% of losses were to other resident males. Overall, [~]5-15% of offspring in these groups were sired by extra-group males. Our results largely confirm earlier findings based on smaller datasets, but also show that the relationship between social organization and paternity is more complicated than simple categorical predictions suggest. We discuss the gap between the data that would ideally be available for testing these hypotheses versus what currently exists, with hopes that our living database can help close this gap over time.
Wu, H.; Medvedev, P.
Show abstract
Estimating mutation rates between evolutionarily related sequences is a central problem in molecular evolution. Due to the rapid expansion of datasets, modern methods avoid costly alignment and instead focus on comparing sketches of sets of constituent k-mers. While these methods perform well on many sequences, they are not robust to highly repetitive sequences such as centromeres. In this paper, we present three new estimators that are robust to the presence of repeats. The estimators are applicable in different settings, based on whether they need count information from zero, one, or both of the sequences. We evaluate our estimators empirically using highly repetitive alpha satellite sequences. Our estimators each perform best in their class and our strongest estimator outperforms all other tested estimators. Our software is open-source and freely available on https://github.com/medvedevgroup/Accurate_repeat-aware_kmer_based_estimator.
Tiatragul, S.; Brennan, I. G.; Skeels, A.; Zozaya, S. M.; Esquerre, D.; Keogh, J. S.; Pepper, M.
Show abstract
Continental radiations record the long-term interplay between environmental change, ecological opportunity, and lineage diversification across large geographic scales. The gecko family Diplodactyl-idae represents one such radiation with [~]200 species distributed across Australia, New Caledonia, and Aotearoa New Zealand, occupying ecological forms ranging from burrow-dwelling desert spe-cialists to canopy climbers, and diversifying over a [~]45 Ma history shaped by dramatic continental environmental change. Using [~]5000 nuclear loci, we reconstructed phylogenetic relationships and divergence times, estimated ancestral ecology and biomes, and modeled the effects of habitat use on diversification and morphology. Crown diplodactylids originated in the mid-Eocene ([~]45 Ma), with the core Australian clade radiating in the Oligocene ([~]28 Ma), substantially younger than previous estimates. Ancestral state estimation indicated arboreal origins in mesic environments, followed by repeated transitions into open habitats and expansion into semi-arid and arid biomes. Diversification rates vary among habitat use but differences were moderate. Size varies with habitat use, but tail morphology is phylogenetically conserved despite dominating overall variation. These patterns indicate that environmental change and biome transformation generated ecological oppor-tunity, promoting diversification through repeated habitat transitions and morphological divergence, providing a macroevolutionary framework linking environmental change, ecological expansion, and trait evolution in a continental radiation.
Howard, L.; Wagner, P. J.
Show abstract
Paleobiologists commonly use genera as a proxy for species in biodiversity studies. However, a lingering concern is that patterns among genera might not always faithfully reflect patterns among species. To date, the concern has focused chiefly on measured patterns of richness over time and on implied origination and extinction rates. However, similar issues might arise for studies of morphological disparity. Moreover, there potentially are additional implications of disparity patterns among species versus those among genera concerning the range of observable anatomical characters and whether disparity within genera is comparable to disparity among genera. If clades have some relatively slowly changing characters that workers have used to denote different genera, then we would expect to see congeneric species to cluster in morphospace; however, if such characters are rare, then within-genus disparity might approach among-genus disparity. Here, we use genus-level and species-level disparity patterns among acanthoceratid ammonoids from the Late Cretaceous. In particular, we examine whether these different level imply different evolutionary dynamics over a major ecological event (Ocean Anoxic Event 2) and how disparity within genera (i.e., among congeneric species) compares to disparity among genera. We find genus-level disparity somewhat inflates early acanthoceratid disparity but implies similar patterns over the OAE2. We also find that within-genus disparity is slightly lower than among-genus, but not hugely so. The combined results suggest that acanthoceratoid shell anatomy does not really show "genus" level characters, even if congeneric species do tend to be more similar to each other than to species in other genera. Thus, this might provide more of a warning for other types of studies using anatomical data (e.g., phylogenetic studies) than for disparity studies. Non-technical SummaryMany paleobiologists use genera to examine scientific questions. This leads to questions over whether this broader approach misses important species-level patterns. This study uses acanthoceratid ammonoids from the Late Cretaceous to examine disparity patterns at both the genus-level and the species-level. We specifically examine the disparity at both levels of this group over a time of high stress for this group, Ocean Anoxic Event 2 (OAE2). Our results show that genus-level disparity slightly exaggerates early acanthoceratid disparity but lowers to a similar pattern to the species-level disparity during OAE2. Within-genus disparity is shown to be slightly lower than among-genus, but not enough to be startling. Together, these results indicate that while some species within the same genus tend to be more alike to each other than those in other genera, there isnt a set of true "genus" level characters. This outcome leads to a warning against using anatomical data in phylogenetic studies, but less so for disparity studies.
Smith, M. L.; Moshier, S.; Shoobs, N. F.
Show abstract
The temperate rainforests of the Pacific Northwest of North America harbor many endemic taxa whose evolutionary histories have been shaped by major climatic and geologic events. The enigmatic taildropper slugs (genus Prophysaon) are one example, notable for their ability to autonomize their tails to escape predators. Despite extensive work uncovering the evolutionary history of individual lineages, relationships among the nine recognized species of Prophysaon remain poorly understood due to insufficient molecular data. To address this, we collected transcriptomes for six of the nine currently accepted species of Prophysaon. Using these data, we were able to resolve species relationships, calling into question the existing subgeneric classification based on morphology. We also detected undescribed phenotypic diversity within the P. andersonii--P. foliolatum species complex, with molecular data supporting the distinctness of two phenotypically distinct populations from Washington. Finally, our transcriptomic data suggest a moderate role of introgression in shaping the evolutionary history of Prophysaon. Here, we synonymize the subgenus Mimetarion with nominotypical Prophysaon. Future work should further investigate whether the undescribed diversity detected here represents species level differentiation.
Vasylenko, L.; Livnat, A.
Show abstract
At the fundamental conceptual level, two alternatives have traditionally been considered for how mutations arise and how evolution happens: 1) random mutation and natural selection, and 2) Lamarckism. Recently, the theory of Interaction-based Evolution (IBE) has been proposed, according to which mutations are neither random nor Lamarckian, but are influenced by information accumulating internally in the genome over generations. Based on the estimation-of-distribution algorithms framework, we present a simulation model that demonstrates nonrandom, non-Lamarckian mutation concretely while capturing indirectly several aspects of IBE: selection, recombination, and nonrandom, non-Lamarckian mutation interact in a complementary fashion; evolution is driven by the interaction of parsimony and fit; and random bits do not directly encode improvement but enable generalization by the manner in which they connect with the rest of the evolutionary process. Connections are drawn to Darwins observations that changed conditions increase the rate of production of heritable variation; to the causes of bell-shaped distributions of traits and how these distributions respond to selection; and to computational learning theory, where analogizing evolution to learning in accord with IBE casts individuals as examples and places the learned hypothesis at the population level. The model highlights the importance of incorporating internal integration of information through heritable change in both evolutionary theory and evolutionary computation.
Villa-Machio, I.; Masa-Iranzo, I.; Nürk, N. M.; Pokorny, L.; Meseguer, A. S.
Show abstract
The combination of target capture sequencing (TCS) with low-coverage whole genome sequencing (lcWGS), an approach known as Hyb-Seq, has allowed the integration of natural history collections into the genomics revolution, transforming biodiversity research. To implement Hyb-Seq, a collection of genomic targets, often nuclear orthologs, is needed to design probes for TCS. In flowering plants, the universal Angiosperms353 probe set has been proven resolutive at multiple evolutionary scales, with caveats. Malpighiales is known to be one of the most challenging flowering plant orders to resolve. Within this order, the clusioid clade ([~]2.2K species, 94 genera, five families) is no exception. To resolve phylogenetic relationships in this recalcitrant clade, we design a custom probe set, the Clusioids626 kit, composed of 39,936 120-mer probes targeting 626 nuclear orthologs ([~]6.6M nucleotides). This probe set includes all Angiosperms353 targets and 273 clusioid-specific ones, carefully chosen taking copy-number, length evenness, and phylo-informativeness into account. We test our probe set on 70 accessions representing all families and tribes in the clusioid clade. On average, 50.4% of TCS reads mapped to our targets, recovering a median of [~]600 orthologs. Relationships for all clusioid families are fully resolved for our nuclear targets. Additionally, 105 plastid coding DNA sequences were retrieved from the lcWGS fraction. A strong cyto-nuclear conflict was detected. The Clusioids626 kit performs better than the universal Angiosperms353 enrichment panel alone. Our kit design workflow can be extended into other lineages for which a universal probe set exists but more resolution is needed.
Mah, J. C.; Lohmueller, K. E.
Show abstract
Accurate estimation of population demographic history is central to population genetics yet remains challenging due to the sensitivity of inference methods to the number of individuals and the demographic scenario assumed in inference. The site-frequency spectrum (SFS) of neutral variants, a widely used summary statistic of genetic variation, is particularly sensitive to demographic processes, but studies have shown that qualitative results from demographic inference, i.e., population expansion vs. contraction, can depend strongly on the number of individuals in the dataset. Here, we analyzed two simulated datasets and one empirical dataset characterized by an ancient population bottleneck followed by a recent population expansion. Fitting a two-epoch demographic model across a range of sample sizes, we found that inference shifted from signals of ancient population contraction at small sample sizes to signals of recent population expansion at large sample sizes. Other summary statistics, including Tajimas D and the proportion of singletons, also changed with sample size. We found that these changes of inferred evolutionary signals under a two-epoch model can be explained by the epoch which contributes the highest mean proportion of coalescent branch lengths. Our results highlight that demographic inference depends critically on the number of individuals analyzed and suggest that analyzing datasets at multiple sample sizes can reveal complementary aspects of population history.